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Poisson Matrix Recovery and Completion 

Yang Cao, Yao Xie 


Abstract —We extend the theory of low-rank matrix recovery 
and completion to the case when Poisson observations for a linear 
combination or a subset of the entries of a matrix are available, 
which arises in various applications with count data. We consider 
the usual matrix recovery formulation through maximum likeli¬ 
hood with proper constraints on the matrix M of size di-by-c? 2 , 
and establish theoretical upper and lower bounds on the recovery 
error. Our bounds for matrix completion are nearly optimal up to 
a factor on the order of C>(log(ciici2)). These bounds are obtained 
by combing techniques for compressed sensing for sparse vectors 
with Poisson noise and for analyzing low-rank matrices, as well 
as adapting the arguments used for one-bit matrix completion 
[1] (although these two problems are different in nature) and the 
adaptation requires new techniques exploiting properties of the 
Poisson likelihood function and tackling the difficulties posed by 
the locally sub-Gaussian characteristic of the Poisson distribution. 
Our results highlight a few important distinctions of the Poisson 
case compared to the prior work including having to impose 
a minimum signal-to-noise requirement on each observed entry 
and a gap in the upper and lower bounds. We also develop a 
set of efficient iterative algorithms and demonstrate their good 
performance on synthetic examples and real data. 

Index Terms —low-rank matrix recovery, matrix completion, 
Poisson noise, estimation, information theoretic bounds 


1. Introduction 

Recovering a low-rank matrix M with Poisson observations 
is a key problem that arises from various real-world applications 
with count data, such as nuclear medicine, low-dose x-ray 
imaging [2], network traffic analysis [3], and call center data [4]. 
There the observations are Poisson counts whose intensities are 
determined by the matrix, either through a subset of its entries 
or linear combinations of its entries. 

Thus far much success has been achieved in solving the 
matrix completion and recovery problems using nuclear norm 
minimization, partly inspired by the theory of compressed 
sensing [5], [6]. It has been shown that when M is low rank, 
it can be recovered from observations of a subset or a linear 
combination of its entries (see, e.g. [7]-[15]). Earlier work on 
matrix completion typically assume that the observations are 
noiseless, i.e., we may directly observe a subset of entries of M. 
In the real world, however, the observations are noisy, which 
is the focus of the subsequent work [16]-[21], most of which 
consider a scenario when the observations are contaminated by 
Gaussian noise. The theory for low-rank matrix recovery under 
Poisson noise has been less developed. Moreover, the Poisson 
problems are quite different from their Gaussian counterpart, 
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since under Poisson noise the variance of the noisy observations 
is proportional to the signal intensity. Moreover, instead of using 
£2 error for data fit, we need to use a highly non-linear likelihood 
function. 

Recently there has also been work that consider the more 
general noise models, including noisy 1-bit observations [1], 
which may be viewed as a case where the observations are 
Bernoulli random variables whose parameters depend on a un¬ 
derlying low-rank matrix; [22], [23] consider the case where all 
entries of the low-rank matrix are observed and the observations 
are Poisson counts of the entries of the underlying matrix, and 
an upper bound is established (without a lower bound). In the 
compressed sensing literature, there is a line of research for 
sparse signal recovery in the presence of Poisson noise [24]- 
[26] and the corresponding performance bounds. The recently 
developed SCOPT [27], [28] algorithm can also be used to solve 
the Poisson compressed sensing of sparse signals but may not 
be directly applied for Poisson matrix recovery. 

In this paper, we extend the theory of low-rank matrix 
recovery to two related problems with Poisson observations: 
matrix recovery from compressive measurements, and matrix 
completion from observations of a subset of its entries. The 
matrix recovery problem from compressive measurements is 
formulated as a regularized maximum likelihood estimator 
with Poisson likelihood. We establish performance bounds by 
combining techniques for recovering sparse signals under Poisson 
noise [24] and for establishing bounds in the case of low-rank 
matrices [29], [30]. Our results demonstrate that as the intensity 
of the signal increases, the upper bound on the normalized error 
decays at certain rate depending how well the matrix can be 
approximated by a low-rank matrix. 

The matrix completion problem from partial observations 
is formulated as a maximum likelihood problem with proper 
constraints on the matrix M (nuclear norm bound ||M||* < 
as/rdid 2 for some constant a and bounded entries /3 < 
Mij < a)*. We also establish upper and lower bounds on the 
recovery error, by adapting the arguments used for one-bit matrix 
completion [1]. The upper and lower bounds nearly match up 
to a factor on the order of O{\og{did 2 )), which shows that the 
convex relaxation formulation for Poisson matrix completion 
is nearly optimal. We conjecture that such a gap is inherent to 
the Poisson problem In the sense that It may not be an artifact 
due to our proof techniques. Moreover, we also highlight a few 
important distinctions of Poisson matrix completion compared to 
the prior work on matrix completion in the absence of noise and 

’Note that the formulation differs from the one-bit matrix completion case 
in that we also require a lower bound on each entry of the matrix. This is 
consistent with an intuition that the value of each entry can be viewed as the 
signal-to-noise ratio (SNR) for a Poisson observation, and hence this essentially 
poses a requirement for the minimum SNR. 



with Gaussian noise: (1) Although our arguments are adapted 
from one-bit matrix completion (where the upper and lower 
bounds nearly match), in the Poisson case there will be a gap 
between the upper and lower bounds, possibly due to the fact that 
Poisson distribution is only locally sub-Gaussian. In our proof, 
we notice that the arguments based on bounding all moments 
of the observations, which usually generate tight bounds for 
prior results with sub-Gaussian observations, do not generate 
tight bounds here; (2) We will need a lower bound on each 
matrix entry in the maximum likelihood formulation, which can 
be viewed as a requirement for the lowest signal-to-noise ratio 
(since the signal-to-noise ratio (SNR) of a Poisson observation 
with intensity / is ^/l). 

We also present a set of efficient algorithms, which can be used 
for both matrix recovery based on compressive measurements 
or based on partial observations. These algorithms include 
two generic (gradient decent based) algorithms: the proximal 
and accelerated proximal gradient descent methods, and an 
algorithm tailored to Poisson problems called the Penalized 
Maximum Likelihood Singular Value Threshold (PMLSVT) 
method. PMLSVT is derived by expanding the likelihood 
function locally in each iteration, and hnding an exact solution 
to the local approximation problem which results in a simple 
singular value thresholding procedure [13]. The performance of 
the two generic algorithms are analyzed theoretically. PMLSVT 
is related to [31]-[33] and can be viewed as a special case where 
a simple closed form solution for the algorithm exists. Good 
performance of the PMLSVT is demonstrated with synthetic and 
real data including solar flare images and bike sharing count data. 
We show that the PMLSVT method has much lower complexity 
than solving the problem directly via semidehnite program and 
it has fairly good accuracy. 

While working on this paper we realize a parallel work [34] 
which also studies performance bounds for low rank matrix 
completion with exponential family noise and using a different 
approach for proof (Poisson noise is a special case of theirs). 
Their upper bound for the mean square error (MSB) is on the 
order of O (log((ii -I- (( 2 )^ maxjdi, d 2 }/m) (our upper bound 
is O (log((ii(i 2 )[?'(<^i + <^ 2 )/w]^/^)), and their lower bound is 
on the order of O (rmax{(ii,(i 2 }/w) (versus our lower bound 
is O -I- d 2 )/raY^‘^'). There might be two reasons for the 

difference. First, our sampling model (consistent with one bit 
matrix completion in [1]) assumes sampling without replacement, 
therefore are at most did 2 observations, and each entry may be 
observed at most once. In contrast, [34] assumes sampling with 
replacement, therefore there can be multiple observations for the 
same entry. Since our result heavily depends on the sampling 
model, we suspect this may be a main reason for the difference. 
Another possible reason could be due to different formulations. 
The formulation for matrix completion in our paper is a 
constrained optimization with an exact upper bound on the matrix 
nuclear norm, whereas [34] uses a regularized optimization with 
a regularization parameter A (which is indirectly related to the 
nuclear norm of the solution), but there is no direct control of 
the matrix nuclear norm. Also, note that their upper and lower 
bounds also has a gap on the order of log((ii + d 2 ), which is 
consistent with our result. On the other hand, compared with 


the more general framework for M-estimator [35], our results 
are specihc to the Poisson case, which may possibly be stronger 
but do not apply generally. 

The rest of the paper is organized as follows. Section II sets up 
the formalism for Poisson matrix completion. Section III presents 
matrix recovery based on constrained maximum likelihood 
and establishes the upper and lower bounds for the recovery 
accuracy. Section IV presents the PMLSVT algorithm that solves 
the maximum likelihood approximately and demonstrates its 
performance on recovering solar flare images and bike sharing 
count data. All proofs are delegated to the appendix. 

The notation in this paper is standard. In particular, IR+ denotes 
the set of positive real numbers and Z™ denotes a m-dimensional 
vector with positive integer entries; |(i] ={1,2,..., d}; (x)'*' = 
maxja;, 0} for any scalar x; Let [x\j denote the jth element of 
a vector cc; I{[£]} is the indicator function for an event e; |A| 
denotes the number of elements in a set A; diagja;} denotes a 
diagonal matrix with entries of a vector x being its diagonal 
entries; ldixd 2 denotes an (ii-by-d 2 matrix of all ones. Let 
||a;||i, ||a ;||2 denote the £i and £2 norms of a vector x. Let 
entries of a matrix X be denoted by Xij or [X]ij. For a matrix 
X = [xi,... ,Xn] with Xj being the jth column, vec(X) = 
[xj ,..., xJ^Y denote vectorized matrix. Let |j AH be the spectral 
norm which is the largest absolute singular value, ||A|ji^ = 
(Si j be the Frobenius norm, ||A|j* be the nuclear norm 

which is the sum of the singular values, ||A||i_i = \^ij\ 

be the £i norm, and hnally ||A|joo = max^ \Xij \ be the infinity 
norm of the matrix. Let rank(A) denote the rank of a matrix X. 
We say that a random variable Z follows the Poisson distribution 
with a parameter A (or Z ^ Poisson(A)) if its probability mass 
function P(Z = k) = e“^A^/(fc!)). Finally, let KHZ'] denote the 
expectation of a random variable Z. 

The only set of non-conventional notation that we use is the 
following. By a slight abuse of notation, we denote the Kullback- 
Leibler (KL) divergence between two Poisson distributions with 
parameters p and q, p,q € IR_|_ as 

Dip\\q) =p\og{p/q) - (p-q), 


and denote the Hellinger distance between two Poisson distribu¬ 
tions with parameters p and q as 

dH{p,q) = 2 - 2exp|-i {y^-y/qf 

It should be understood that the KL distance and the Hellinger 
distance are defined between two distributions and here the 
arguments p and q are merely parameters of the Poisson 
distributions since we restrict our attention to Poisson. Based on 
this, we also denote, by a slight abuse of notation, the average 
KL and Hellinger distances for two sets of Poisson distributions 

whose parameters are determined by entries of two matrices P, 

Q (Z Rdrxd2. 
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d%{P,Q) = 
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dH{Pi],Qij) 
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II. Formulation 


A. Matrix recovery 

Given a matrix M G consisting of positive entries, 

we obtain m Poisson measnrements y = [yi,... G 

that take the forms of 


^ Poisson(tr(A^M)), i = l,...,TO. (I) 

where Ai G and it models the measnring process of 

physical devices, e.g., the compressive imaging system given 
in [2]. For example, Ai can be interpreted as a mask imposed 
on the scene, and one measurement is made by integrating the 
total photon intensity passing through the mask. Our goal is to 
estimate the signal M G from measnrements y G Z™. 

Define a linear operator A : —>• such that 

[AM],^tx{AlM). ( 2 ) 


So by defining 

vec(Ai)''' 

: 

vec(^m)T 


/ = vec(M), 


we can write 

AM = Af. 


( 3 ) 


The following assumptions are made for matrix recovery. First, 
the total Intensity of M given by 

is known a priori. Second, assume an entry-wise lower bonnd 
\M]jk > c for some constant c > 0. This prevents the degeneracy 
later on since the rate of a Poisson random variable has to be 
positive. Third, motivated by the assumptions made in Poisson 
compressed sensing [24] and to have physically realizable optical 
systems, we assnme that A satisfies the following constraints: 

(1) positivity-preserving: for any nonnegative inpnt matrix M, 
the measurements must also be nonnegative; equivalently, 

Mij > 0 for all i,j => > 0 for all i; 

(2) flux-preserving: the mean total intensity of the observed signal 
must not exceed to total Intensity incident upon the system: 

m 

i=l 

Physically, this means the photon connting measnrements can 
only be positive and we cannot measnre more light than what 
is available. 

We consider a regularized maximum-likelihood formulation. 
In the matrix recovery problem, the log-Iikelihood fnnction is 
given by 

m 

LaAX) = log[-4X], - [AX],} - Apen(W), (4) 

i=l 

where the subscript A and y indicate the given data in defining 
the log likelihood function. Based on the previous assumptions. 


we define a countable candidate set 

r^{X,G : IIW,II 1,1 = I, [X,]jk > c,i = 1, 2,...}, (5) 

for some constant c > 0. This F can be interpreted as a 
discretized feasible domain of the general problem. Note that 
the true matrix M gT. Also introduce a regularization function 
that satisfies the Kraft inequality [24], [25], 

^ g-pen(X) < 
xgt 

A Kraft-compliant regularization fnnction typically assigns a 
small value for a lower rank matrix X and assigns a large value 
for a higher rank matrix X. Using Kraft-compliant regularization 
to prefix codes for estimators is a commonly nsed technique 
in constmcting estimators [24]. An estimator M is obtained by 
solving the following convex optimization problem: 

M = arg max U_ 4 ,y(W). {matrix recovery} (7) 
xer 

Remark 1 (Relation to Poisson compressed sensing). In Poisson 
compressed sensing [25], the measurement vector is given by 
y ~ Poisson{A'^x), where A is a sensing matrix and x is a 
sparse vector of interest. From (3), we see that the measurement 
model of Poisson matrix recovery can be written in this form too. 
However, if we vectorize M and solve it as a Poisson compressed 
sensing problem (3), the low-rank structure of M will be lost. 
Hence, the extension to Poisson matrix recovery is important, 
since in many applications such as compressive imaging the 
signal is well modeled as a nearly low-rank matrix. 


B. Matrix completion 

A related problem is matrix completion. Given a matrix 
M G consisting of positive entries, we obtain noisy 

observations for a snbset of its entries on an index set 
n C |c?i] X |(i2l- The indices are randomly selected with 
E[|0|] = m. In other words, I{(i, j) G U} are i.i.d. Bernonlli 
random variables with parameter m/{did 2 ). The observations 
are Poisson counts of the observed matrix entries and they are 
mutually Independent 

Fy ^ Poisson(My), V(i, j) G U. (8) 

Onr goal is to recover M from the Poisson observations 

The following assumptions are made for the matrix completion 
problem. First, we set an npper bonnd a > 0 for each entry 
Mij < a to entail that the recovery problem is well-posed [19]. 
This assumption is also reasonable in practice; for instance, M 
may represent an image which is nsually not too spiky. The 
second assumption is characteristic to Poisson matrix completion: 
we set a lower bound /3 > 0 for each entry Mij > [3. This entry- 
wise lower bound is required by our later analysis (so that the 
cost function is Lipschitz), and it also has an interpretation of 
a minimum required signal-to-noise ratio (SNR), since SNR of 
a Poisson observation with intensity I is given by s/l. Third, 
we make a similar assumption to one-bit matrix completion [1]; 
the nnclear norm of M is npper bonnded ||M||* < a\/rdid 2 . 
This is a relaxation of the assumption that M has a rank exactly 
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r (some small integer). This particular choice arises from the 
following consideration. If < a and rank(M) < r, then 

ll^^ll* < V^II^IIf < \/rdi(k\\M\\ac < a\/rdid2- 

We consider a formulation by maximizing the log-likelihood 
function of the optimization variable X given our observations 
subject to a set of convex constraints. In the matrix completion 
problem, the log-likelihood function is given by 

Fn,YiX)= Y,j\ogX,j-X,j, (9) 

where the subscript fl and Y indicate the data involved in the 
maximum likelihood function F. Based on previous assumptions, 
we dehne a candidate set 

5^ : ||X|U 

P < X,j < a,y{i,j) e {dll X Idal} • 

An estimator M can be obtained by solving the following convex 
optimization problem: 

M = argmax Fq y{X). {matrix completion} (11) 
xes 


C. Relation of two formulations 

Note that the matrix completion problem can also be formu¬ 
lated as a regularized maximum likelihood function problem 
similar to (7). However, we consider the current formulation for 
the convenience of drawing connections, respectively, between 
Poisson matrix recovery and Poisson compressed sensing studied 
in [24], as well as Poisson matrix completion and one-bit matrix 
completion studied in [1]. 

Indeed, these two formulations in the forms of (7) and (11) 
are related by the well-known duality theory in optimization (see, 
e.g. [36]). Consider a penalized convex optimization problem: 

mmf{x) + Xg{x), A > 0, (12) 

X 

and the constrained convex optimization problem: 

min/(a;) subject to g{x) < c. (13) 

X 

Denote x* as the solution to (12). Then x* is also the solution 
to (13), if we set c = g{x*). Conversely, denote x* as the 
solution to problem (13). We can interpret A > 0 as the the 
Lagrange multiplier and consider the Lagrange dual problem. 
Under Slater’s condition (i.e. there exists at least one x such 
that g{x) < c), there is at least one A such that x* is also the 
solution to problem (12). Therefore, (12) and (13) are equivalent 
in the sense that the two problems have the same minimizer for 
properly chosen parameters. More details can be found in [37]. 
Using the suggestion by Theorem 1 in [37], we choose A in (7) 
to be around 1/as/rdidf. 


III. Performance Bounds 


In the following, we use the squared error 

R{M,M) 4 \\M-M\\l, 


as a performance metric for both matrix recovery and matrix 
completion problems. 


A. Matrix recovery 

To extend the upper bounds in Poisson compressed sensing 
[24] to the matrix recovery setting, we introduce a class of nearly 
low-rank matrices whose singular values decay geometrically. 
Using a particular choice for the sensing matrices Ai that satisfy 
certain property (Lemma 1), we present a performance guarantee 
for the estimator in the general setting (Lemma 2) and then for 
nearly low-rank matrices (Theorem 1). 

1) Sensing operator: Let Zi, i = 1,... ,m denote a (ii-by-(i 2 
matrix with entries i.i.d. follow the distribution 

/,_ n1/2 

( 1 , with probability p; 

\ 1/2 
1-pi 


[Zi]jk — 


(15) 


) , with probability 1 — p. 


Dehne 


A, = Z, 


which consists of a random part and a deterministic part: 


A, = 


m 


1 — p 

Ai ~\~ 

m 


(16) 


This construction is inspired by [24] for the purpose of 
corresponding to a feasible physical system. In particular, every 
element of Ai is nonnegative and scaled properly, so that the 
measured photon intensity is no greater than the photon intensity 
of the original signal. It can be verihed that Ai and the associated 
A satisfy the requirements in the previous section. In particular, 
(1) all entries of Ai take values of 0 or l/m; (2) A satishes hux 
preserving: for any matrix X with positive entries [X]ij > 0, 
since all entries of Ai are less than 1/m, 


d-i do 


m di d 2 


i—l j=l k—1 




(3) with probability at least 1 — every matrix Ai has 

at least one non-zero entry. It follows that for a matrix X such 
that [X]ij > c, under the event described above, we have 


d\ d2 d\ d2 

c'^'^[Ai\jk > (17) 

j—1 k—1 j—1 k—1 


This prevents degeneracy since is a rate of Poisson and 

has to be positive. 

The random part of the operator A has certain restrictive 
isometry property (RIP) similar to [5], as formalized in the 
following lemma and proved by extending Theorem 1 of [24]. 
Similar to [24], Lemma 1 is used in proving the performance 
upper bound in a general setting (Lemma 2). 


Lemma 1 (RIP of operator A). Consider the operator A defined 
by AX = [tr{AiX),.. .,tr{A^X)]^ € M™. For all Xi,X 2 G 
^here A {X G Rd^xd 2 . ||X||i,i = 1}, there 

exist absolute constants ci, C 2 > 0 such that the bound 


\\Xi - X2fp < 4\\AXi - AX2f2 


2cigplog(c2gpdi(i2/TO) 


(14) 
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m 



holds with probability at least 1 — e 

o li /2 


£ = 


2p(l-p) 


1 , 


vv/zere 

ifp ^ 1/2; 

otherwise. 


(18) 


Moreover, there exist absolute constants 03,04 > 0 such that 
for any finite set T C 5^1 where A {X G 

jg^dixdz . ||x||p’ = 1} is the unit sphere, if m > 04^^ log2 \T\, 
then 

^<\\AX\\l<-, forallXGT 
holds with probability at least 1 — 


This lemma follows directly from applying Theorem 1 in [24] 
to the vectorized version of the matrix recovery problem (3) 
as well as using the fact || 2 f||F = ||vec( 2 f )||2 and |j 2 f|ji^i = 
||vec(2f)|ji for a matrix X. 

2) General matrices: For an arbitrary matrix M, given a 
suitable candidate set F, and if the operator A satisfies the RIP 
in Lemma 1, we may obtain a regret bound for the estimator M 
obtained from (7). Note that the result does not require M to 
be nearly low-rank. The result takes a similar form as Theorem 
2 of [24] except that the vector signal dimension is replaced 
by did 2 . Lemma 2 is used for establishing the regret bound for 
nearly low-rank matrices in Theorem 1. 


Lemma 2 (Regret bound). Assume the candidate set F defined 
in (5). Let Q be the collection of all subsets Fq C F, such that 
IFqI < for defined in (18). Then with probability 

at least 1 — did 2 e~^'^ for some positive constant k depending 
on Cl, C 3 and p: 


—K[R{M,M)] < Cm,,p min jnin 
I ’ Toes Mere 

‘^■cl£,p\og(c2£,pdid2/m) 

A , 

m 


R{M, M) pen{M) 

--f - 


(19) 


where Cm.p — max | p(i-p) | trt, and the expectation is 
taken with respect to the random Poisson measurements pi ~ 
Poisson{tr{AiM)) for a fixed A 


Remark 2. This bound can be viewed as an “oracle inequality”: 
if given the knowledge of the true matrix M, we may first obtain 
an “oracle error”, the term inside the square bracket in (19), 
which is the error associated with the best approximation of M 
in the set Fq, for all such possible sets Fq C F that the size of 
Fq is at most 0(2™). Then the normalized expected squared 
error of the maximum likelihood estimator is within a constant 
factor from this oracle error. 


3) Nearly low-rank matrices: In the following, we establish 
the regret bound for nearly low-rank matrices, whose singular 
values decay geometrically. Theorem 1 is obtained by extending 
Theorem 3 in [24], with key steps including realizing that the 
vector formed by the singular values of the nearly low-rank 
matrix is “compressible”, and invoking a covering number for 
certain subset of low-rank matrices in [29]. 


To extend the definition of compressible signals [5], [24] in the 
matrix setting, we consider a family of nearly low-rank matrices 
in the following sense. Assume the singular value decomposition 
of a matrix X G IR^i ^'^2 given by AT = UAi&g{9}V, where 
9 = [9i,... ,9dA is a vector consists of singular values, and 
l^*!! > \d 2 \ >■■■> l^dl with 

d = min{(ii, (i 2 }- 

Suppose there exists 0 < g < 00 , 0 < p < 00 , such that 

\0j\<Ql3-^/\ j = f...,d. ( 20 ) 

Any 9 satisfying (20) is said to belong to the weak-f^ ball of 
radius gl. It can be verified that the condition (20) is equivalent 
to the following 

\{j G M : \9,\ > cl}\ < (p/c)« 

holding for all c > 0. Hence, if we truncate the singular values 
of a nearly low-rank matrix less than cl, then the rank of the 
resulted matrix is (p/c)*. In other words, q controls the speed 
of the decay: the smaller the q, the faster the decay, and the 
smaller the rank of the approximating matrix by thresholding 
the small singular values. We shall focus primarily on the case 
0 < g < 1. Also, for ||Ar||i_i = I, 

d \ 

=ii^iif< iixiii,i = j, 

so we can take g to be a constant independent of I or d. That 
is also the reason in ( 20 ) for us to choose the constant to be 
gl since g has the meaning of a factor relatively to the total 
intensity I. 



For these nearly low-rank matrices, the best rank-£ approxi¬ 
mation to X can be constructed as 

= C/diag{ 6 i(^)}yT^ 

where = [9i, ..., 0^, 0,..., 0]"''. Note that 
||X - = ||C/diag {0 - 

= < i^cog^r^°^', 

for some constant cq > 0 that depends only on g, and a' = 
(1/g — 1/2) [5]. For matrix to be nearly low-rank, we want 
g to be close to 0 , and hence usually a' = 1 /g — 1/2 > 0 . 
The following regret bound for a nearly low-rank matrix is a 
consequence of Lemma 2. 


Theorem 1 (Matrix recovery; regret bound for nearly low-rank 
matrices). Assume M G is nearly low-rank, Mij > c 

for some positive constant c G (0,1). Then there exists a finite 
set of candidates F, and a regularization function satisfying the 
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Kraft inequality (6) such that the bound 


C>(m )|2 + ^+ min 
, ('^og{did2/m)\ 


d2+f) log 2 d 

~1j 


holds with the same probability as in Lemma 2 for some constant 
Cq > 0 that depends only on q, 

4 = 2TO/[c4^p(di+^ 2 +4)log2d], (23) 

and is defined in (18). Here the expectation is with respect to 
the random Poisson measurements for a fixed realization of A. 

Remark 3. In fact we may compute the minimum in the 
optimization problem inside Theorem 1 in terms of 1. Note 
that the first term is decreasing in £, and the second term 

[\£{di + d 2 + f) log 2 (i]/(2/) is increasing in £, so we may 
readily solve that the minimum in (22) is obtained by 

““ |[A(di+d2+4)log2CiJ ’ *]■ 

If further £^ > {4a'//[A(di + ((2 + 4) log 2 d]}^ ^ which 

is true when the number of measurements m is sufficiently large, 
we may substitute ^min into the expression and simplify the 
upper bound (22) to be 


\{di + ((2 + 4) log 2 (i\ 

4co£'2/ J 


\og{did2/m) 


The implication of Theorem 1 is that, for a nearly low-rank 
matrix whose singular values decay geometrically at a rate of 
j = 1,2,... and the total intensity of the matrix is I, 


reconstruction error 


{di + (( 2 ) log 2 min{di, (i 2}1 log{did 2 /m) 


for m sufficiently large. This also implies that the intensity I 
(or SNR s/1) needs to be sufficiently large for the error bound 
to be controllable by increasing the number of measurements. 


B. Matrix completion 

For matrix completion, we first establish an npper bonnd 
for estimator in ( 11 ), and then present an information theoretic 
lower bonnd which nearly matches the npper bonnd np to a 
logarithmic factor 0 {did 2 ). 

Theorem 2 (Matrix completion; upper bound). Assume M G S, 
LI is chosen at random following our Bernoulli sampling model 
with E[|n|] = m, and M is the solution to (11). Then with a 


probability exceeding (1 — C/{did 2 )), we have 


(a(e^ - 2) + 31og((ii(i2)) 


di+d2V'^ 


(di +d2)log(di(i2) ' 
m 

If m> (di + (( 2 ) ^og{did 2 ), then (24) simplifies to 


8aT A / aylr 

(a(e^ - 2) + 31og(c;ic;2)) • (~~~^ 

Above, C',C are absolute constants and T depends only on a 
and /?. Here the expectation and probability are with respect to 
the random Poisson observations and Bernoulli sampling model. 


-—R{M,M) < s/2C' 



The proof of Theorem 2 is an extension of the ingenions 
arguments for one-bit matrix completion [1]. The extension 
for Poisson case here is nontrivial for varions aforementioned 
reasons (notably the non snb-Ganssian and only locally snb- 
Gaussian nature of the Poisson observations). An outline of our 
proof is as follows. First, we establish an upper bound for the 
Kullback-Leibler (KL) divergence D{M\\X) for any Af € 5 by 
applying Lemma 7 given in the appendix. Second, we hnd an 
npper bonnd for the Hellinger distance d/j{M,M) nsing the 
fact that the KL divergence can be bounded from below by the 
Hellinger distance. Finally, we bound the mean squared error in 
Lemma 8 via the Hellinger distance. 

Remark 4. Fixing m, a and j3, the upper bounds (24) and (25) 
in Theorem 2 increase as the upper bound on the nuclear norm 
increases, which is proportional to s/rd/df. This is consistent 
with the intuition that our method is better at dealing with 
approximately low-rank matrices than with nearly full rank 
matrices. On the other hand, fixing di,d 2 ,a, j3 and r, the upper 
bound decreases as m increases, which is also consistent with 
the intuition that the recovery is more accurately with more 
observations. 

Remark 5. Fixing a, and r, the upper bounds (24) and (25) 
on the mean-square-error per entry can be arbitrarily small, in 
the sense that the they tend to zero as di and d 2 go to infinity and 
the number of the measurements m = 0{{di -f (( 2 ) log‘^(fii(i 2 )) 
(m < did 2 ) for 6 > 2. 

We may obtain an upper bound on the KL divergence (which 
may reflect the trne distribntion error) as a conseqnence of 
Theorem 2. 

Corollary 1 (Upper bound for KL divergence). Assume M G S, 
H is chosen at random following the Bernoulli sampling model 
with E[|H|] = m, and M is the solution to (11). Then with a 
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probability exceeding (1 — C/{did 2 )), 


D{M\\M) <2C' (av^//3) (a(e^ - 

di^y/\ [i + 

m J 


2) + 3 log(fii(i 2 )) • 

(di +d2)log(di(i2)''^^^ 


(26) 


Above, C and C’ are absolute constants. Here the expectation and 
probability are with respect to the random Poisson observations 
and Bernoulli sampling model. 


need to bound max^ Yij when Yij is a Poisson random variable 
with intensity Moreover, the Poisson likelihood function 
is non Lipschitz (due to a bad point when tends to zero), 
and hence we need to introduce a lower bound on each entry of 
the matrix , which can be interpreted as the lowest required 
SNR. Other distinctions also include analysis taking into account 
of the property of the Poisson likelihood function, and using 
KL divergence as well as Hellinger distance that are different 
from those for the Bernoulli random variable as used in [1]. 


The following theorem establishes a lower bound and demon¬ 
strates that there exists an M G S such that any recovery method 
cannot achieve a mean square error per entry less than the order 
ofOi^ niax{di,d2}/m). 


Theorem 3 (Matrix completion; lower bound). Fix a, j3, r, di, 
and d 2 to be such that /?>!,«> 2/3, di > I, d 2 > 1, r > 4, 
and a^rmax{di,d 2 } > Cq. Fix flo be an arbitrary subset of 
|(ii] X |ci2l with cardinality m. Consider any algorithm which, 
for any M G S, returns an estimator M. Then there exists 
M G S such that with probability at least 3/4, 


did2 


rmax{di, ^ 2 } 
m 



(27) 


as long as the right-hand side of (27) exceeds 
Cira^/ TO.\n{di,d 2 }, where Cq, Ci and C 2 are absolute 
constants. Here the probability is with respect to the random 
Poisson observations only. 


Similar to [1], [38], the proof of Theorem 3 relies on 
information theoretic arguments outlined as follows. First we 
find a set of matrices x C 5 so that the distance between any 
identified as is sufficiently 

large. Suppose we obtain measurements of a selected matrix 
in X and recover it using an arbitrary method. Then we could 
determine which element of x was chosen, if the recovered 
matrix is sufficiently close to the original one. However, there 
will be a lower bound on how close the recovered matrix can 
be to the original matrix, since due to Fano’s inequality the 
probability of correctly identifying the chosen matrix is small. 


IV. Algorithms 

In this section we develop efficient algorithms to solve the 
matrix recovery (7) and matrix completion problems (11). In the 
following, we use nuclear norm regularization function for the 
matrix recovery in (7). Then, (7) and (11) are both semidefinite 
program (SDP), as they are nuclear norm minimization problems 
with convex feasible domains. Hence, we may solve it, for 
example, via the interior-point method [39]. Although the interior- 
point method may return an exact solution to ( 11 ), it does not 
scale well with the dimensions of the matrix di and d 2 as the 
complexity of solving SDP is 0(df -f did^ + d\d\). 

We develop two set of algorithms that can solve both 
problems faster than the interior point methods. These algorithms 
including the generic gradient descent based methods, and 
a Penalized Maximum Likelihood Singular Value Threshold 
(PMLSVT) method tailored to our problem. We analyzed the 
performance of the generic methods. Although there is no 
theoretical performance guarantee, PMLSVT is computationally 
preferable under our assumptions. Another possible algorithm 
not cover here is the non-monotone spectral projected-gradient 
method [1], [40]. 

A. Generic methods 

Here we only focus on solving the matrix completion problem 
( 11 ) by proximal-gradient method; the matrix recovery problem 
(7) can be solved similarly as stated at the end of this subsection. 

First, rewrite S in (10) as the intersection of two closed and 
convex sets in 

<a, 

V(i,i) e [fill X |d2l}, 


Remark 6 . Fixing a, (3 and r, the conditions in the statement 
of Theorem 3 can be satisfied if we choose sufficiently large di 
and d 2 . 

Remark 7. When m > (di d 2 ) \og{did 2 ), the ratio between 
the upper bound in (25) and the lower bound in (27) is on the 
order of 0(\og(did2)). Hence, the lower bound matches the 
upper bound up to a logarithmic factor. 


and 

T2 = {XG : ||X||* < as/rdid2}, 

where the first set is a box and the second set is a nuclear norm 
ball. Let f{X) = —Fq^y{X) be the negative log-likelihood 
function. Then optimization problem (11) is equivalent to 

M = arg min f{X). (29) 

xeriflTs 


Our formulation and results for Poisson matrix completion 
are inspired by one-bit matrix completion [ 1 ], yet with several 
important distinctions. In one-bit matrix completion, the value 
of each observation Yij is binary-valued and hence bounded; 
whereas in our problem, each observation is a Poisson random 
variable which is unbounded and, hence, the arguments involve 
bounding measurements have to be changed. In particular, we 


Noticing that the search space S = rip|r 2 is closed and 
convex and f(X) is a convex function, we can use proximal 
gradient methods to solve (29). Let Ir(2f) be an extended 
function that takes value zero if AT e F and value 00 if AT ^ F. 
Then problem (29) is equivalent to 

M = arg min /(W)-hIrjPir 2 ( 2 f). (30) 
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To guarantee the convergence of proximal gradient method, we 
need the Lipschitz constant L > 0, which satisfies 

\\vf{u)-vfiv)\\F<L\\u-v\\F, yu,ves, od 

and hence L = a//3'^ by the definition of our problem. Define 
the orthogonal projection of a matrix X onto a convex set F as 

np(X) = argmin \\Z — X\\‘^p. 
zer 

1) Proximal gradient: Initialize the algorithm by = ^ij 

for {i,j) G D and = (a + /3)/2 otherwise. Then iterate 

using 

Xk = IlsiXk-i-{l/L)Vf{Xk-i)). (32) 


This algorithm has a linear convergence rate: 


Lemma 3 (Convergence of proximal gradient). Let {X^} be 
the sequence generated by (32). Then for any k > 1, we have 


f(Xu) - f{M) < 


L\\X^-M\\l 

2k 


2) Accelerated proximal gradient: Although proximal gradi¬ 
ent can be implemented easily, it converges slowly when the 
Lipschitz constant L is large. In such scenarios, we may use 
Nesterov’s accelerated method [41]. With the same initialization 
as above, we perform the following two projections at the fcth 
iteration: 

Xfc = ns(Zu-i - (l/L)V/(Zfe_i)), 

Zk=Xk + {(k-l)/(k + 2))(Xk-Xk-i). ^ ’ 

Nesterov’s accelerated method converges faster: 


Lemma 4 (Convergence of accelerated proximal gradient). Let 
{Xfe} be the sequence generated by (33). Then for any k > 1, 
we have _ 


f{Xk) - f(M) < 


2L\\XQ-Mfp 
(fc + l)2 


3) Alternating projection: To use the above two methods, 
we need to specify ways to perform projection onto the space 
S. Since S is an intersection of two convex sets, we may use 
alternating projection to compute a sequence that converges to the 
intersection of Fi and F 2 . Let Uq be the matrix to be projected 
onto S. Specifically, the following two steps are performed 
at the jth iteration: Vj = nr 2 ([/j_i) and Uj = nri(V,), 
until \\Vj — Uj\\F is less than a user-specified error tolerance. 
Alternating projection is efficient if there exist some closed 
forms for projection onto the convex sets. Projection onto the 
box constraint Fi is quite simple: [nri(X)]ij assumes value 
(3 if Yij < P and assumes value a if Yij > a, and otherwise 
maintains the same value Yij if /3 < Yij < a. Projection onto 
F 2 , the nuclear norm ball, can be achieved by projecting the 
vector of singular values onto a £i norm ball via scaling [13] 
[42]. 

4) Algorithm for matrix recovery: To solve (7), similarly, 
we can assume M is approximately low-rank with a bounded 
nuclear norm and each entry of M is bounded by /3 = c/m and 


a = / in (28) based on the assumptions of Theorem 1. Similar 
steps can be applied hereafter if we consider an additional convex 
set 

Fo ^ {M e : IIA^IIi.i = I, [M]jk > 0,VjA:}. (34) 


B. Penalized maximum likelihood singular value threshold 
(PMLSVT) 

We also develop an algorithm, referred to as PMLSVT, which 
is tailored to solving our Poisson problems. PMLSVT differs 
from the classical projected gradient in that instead of computing 
the exact gradient, it approximates the cost function by expanding 
it using a Taylor expansion up to the second order. The resulted 
approximate problem with a nuclear norm regularization term 
has a simple closed form solution using Theorem 2.1 in [13]. 
Therefore, PMLSVT does not perform gradient descent directly, 
but it has a simple form and good numerical accuracy as verified 
by numerical examples. 

The algorifhm is similar fo fhe fasl iferafive shrinkage¬ 
thresholding algorithm (FISTA) [43] and its extension to matrix 
case with Frobenius error [31]. Similar to the construction in 
[20] and [32], using Aq and Ai as regularizing parameters and 
the convex sets Fq and Fi defined earlier in (34) and (28), we 
may rewrite (7) and (11) as 

M = argmin/i(X) -I- Ai||X||*, i = 0,1, (35) 

xeTi 

respectively, where fo(X) = - 1 {j/i log[AlX]j - [AX]i} 

and/i(X) = -Fn,y(X). 

The PMLSVT algorithm can be derived as follows (similar to 
[31]). For simplicity, we denote f{X) for the fo(X) or fi(X). 
In the fcth iteration, we may form a Taylor expansion of f{X) 
around X^-i while keeping up to second term and then solve 

Xk = a.Tguim[Qt^{X,Xk-i) + A||X||^], (36) 

with 


Qt, (X, Xk-i) = f{Xk-i) + {X- Xk-i,Xf{Xk-i)) 

+ |||X-Xfc_i|||, (37) 

where V/ is the gradient of /, tk is the reciprocal of the step 
size at the fcth iteration, which we will specify later. By dropping 
and introducing terms independent of M whenever needed (more 
details can be found in [44]), (36) is equivalent to 


Xk = 


arg min 
X 


1 

/ 1 . 

^ A „ „ 

2 

X - -V/(Xfc_i)j 

+ ^ ♦ 


Recall d = min{fii,d 2 }. For a matrix Z G jgj 

singular value decomposition he Z = UYAY^, where U € 

V G S = diag{[(Ti,..., and ai is a singular value 

of the matrix Z. For each r > 0, define fhe singular value 
thresholding operator as: 


Dr{Z) 4 [/diag{[(ai - t)+, ..., - t)+]^}VY (38) 


The following lemma is proved in [13]: 



Lemma 5 (Theorem 2.1 in [13]). For each r > 0, and Z € 

Xd2 . 

Dr{Z)= argmin {]^\\X - Z\W + t\\Z\\A . (39) 

XgR£ilX£i2 L xl J 

Due to Lemma 5, the exact solution to (38) is given by 

Xk = (^Xk-i - 2v/(Xfc_i)^ . (40) 

The PMLSVT algorithm is summarized in Algorithm 1. The 
initialization of matrix recovery problem is suggested by [45] 
and that of matrix completion problem is to choose an arbitrary 
element in the set Ti. For a matrix Z, define a projection of Z 
onto Fq as follows: 

where entry of (Z)+ is (Zij)^. In the algorithm 

description, t is the reciprocal of the step size, 77 > 1 is a 
scale parameter to change the step size, and K is the maximum 
number of iterations, which is user specified: a larger K leads to 
more accurate solution, and a small K obtains the coarse solution 
quickly. If the cost function value does not decrease, the step size 
is shortened to change the singular values more conservatively. 
The algorithm terminates when the absolute difference in the 
cost function values between two consecutive iterations is less 
than 0.5/iT. Convergence of the PMLSVT algorithm cannot be 
readily established; however. Lemma 3 and Lemma 4 above 
may shed some light on this. 

Algorithm 1 PMLSVT for Poisson matrix recovery and com¬ 
pletion 

1: Initialize: The maximum number of iterations K, parameters 
a, P, ij, and t. 

X ^ 1 ViAi) {matrix recovery} 

[X]ij ^ Yij for (i, j) G O and [X]ij ^ {a+P)/2 otherwise 
{matrix completion} 

2: for A: = 1,2,... AT do 
3: C^X-{l/t)Xf{X) 

4: C = C/SV' {singular value decomposition} 

5: [Ljii •<— ([Sjii — X/t)^, i = 1,... ,d 

6: X' X {record previous step} 

7: X -^V (C/EVT) {matrix recovery} 

X ^ Ilri (C/SV'j {matrix completion} 

8: If j(X) > Qt(X,X') then t ^ -qt, go to 4. 

9: If |/(A) - Qt{X,X')\ < 0.5/A: then exit; 

10: end for 


Remark 8. At each iteration, the complexity of PMLSVT 
(Algorithm 1) is on the order of 0{d\d2 + d^) (which comes 
from performing singular value decomposition). This is much 
lower than the complexity of solving an SDP, which is on the 
order of 0(d\ Pdid^ + d\d\). In particular, for a d-by-d matrix, 
PMLSVT algorithm has a complexity 0(df'), which is lower 
than the complexity 0{d^) of solving an SDP. One may also 
use an approximate SVD method [46] and a better choice for 


step sizes [43] to accelerate PMLSVT. 

V. Numerical examples 

We use our PMLSVT algorithm in all the examples below. 

A. Synthetic data based on solar flare image 

1) Matrix recovery: In this section, we demonstrate the 
performance of PMLSVT on synthetic data based on a real 
solar flare image captured by the NASA SDO satellite (see 
[47] for detailed explanations). We use this original solar flare 
image to form a ground truth matrix M, and then generate 
Poisson observations of M as described in (1). All the numerical 
examples are run on a laptop with 2.40Hz dual-core CPU and 
8GB RAM. 

The solar flare image is of size 48-by-48 and is shown in 
Fig. 1(a). To generate M, we break the image into 8-by-8 
patches, vectorize the patches, and collect the resulted vectors 
into a 64-by-36 matrix Mq (di = 64 and d 2 = 36). Such a 
matrix can be well approximated by a low-rank matrix M, as 
demonstrated in Fig. 1(b). It is a image formed by M, when M 
is a rank 10 approximation of Mq. Note that visually the rank-10 
approximation is very close to the original image. Below we 
use this rank-10 approximation M as the ground truth matrix. 
The intensity of the image in is / = ||M|| = 3.27 x 10®. 

To vary SNR, we scale the image intensity by p > 1 (i.e., the 
matrix to be recovered M = pM and M is the original matrix). 
Hence, SNR of Poisson observations for the scaled image is 
proportional to s/pl. 

Below, we use the PMLSVT algorithm for recovery and the 
parameters are t = 10“®, q = 1.1, and K = 2500. Fig. 1(c) 
and Fig. 1(d) contain the recovered image for a fixed number of 
measurements m = 1000 and A = 0.002, when p = 2 and p = 7 
respectively. Note that the higher the p (and hence the higher 
the SNR) the less error in the recovered image. In Fig. 2, p 
increases from 1 to 9 and the normalized squared error decreases 
as p increases (although the improvement is incremental after p 
is greater than 3). 

We further compare the quality of the recovered matrix 
using the PMLSVT algorithm (which approximately solves 
the maximum likelihood problem), with the recovered matrix 
obtained by solving the maximum likelihood problem exactly 
via semi definite program (SDP) using CVX^. Solving via SDP 
requires a much higher complexity, as explained in Remark 8. 
Below, we fix p = 4, m = 1000, A = 0.002, and increase the 
number of measurements m while comparing the normalized 
square errors of these two approaches. Fig. 3 demonstrates that 
PMLSVT, though less accurate, has a performance very close 
to the exact solution via SDP. The increase in the normalized 
error of PMLSVT algorithm relative to that of SDP is at most 
4.89%. Also, the normalized errors of both approaches decrease 
as m increases. PMLSVT is a lot faster than solving SDP by 
CVX, especially when m is large, as shown in Table 1. 

Fig. 4 demonstrates the normalized error with different values 
of A, when m = 1000 and p = 4. Note that there is an optimal 
value for A with the smallest error (thus our choice for A = 0.002 
in the above examples). 

^http://cvxr.com/cvx/ 
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Fig. 1: Original, low-rank approximation to solar flare image, and 
recovered images from compressive measurements when the intensity 
of the underlying signal is scaled hy p (SNR is on the order or Ijyfpl). 
The parameters for (c) and (d) are m = 1000 and A = 0.002. 



Fig. 2: Matrix recovery from compressive measurements: normalized 
error R{M, M)/l‘^ versus p, when m = 1000 and A = 0.002, using 
PMLSVT. 


2) Matrix completion: We demonstrate the good performance 
of the PMLSVT algorithm for matrix completion on the same 
solar flare image as in the previous section. Set a = 200 and 
/3 = 1 in this case. Suppose the entries are sampled via a 
Bernoulli model such that E[|n|] = m. Set p = m/{did 2 ) in 
the sampling model. Set t = 10“^ and 77 = 1.1 for PMLSVT. 
Fig. 5 shows the results when roughly 80%, 50% and 30% of 
the matrix entries are observed. Even when about 50% of the 
entries are missing, the recovery results is fairly good. When 
there are only about 30% of the entries are observed, PMLSVT 
still may recover the main features in the image. PMLSVT is 
quite fast: the run times for all three examples are less than 1.2 
seconds. 

B. Bike sharing count data 

To demonstrate the performance of our algorithm on real 
data, we consider the bike sharing data set^, which consists of 
17379 bike sharing counts aggregated on hourly basis between 

^The data can be downloaded at 
http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset [48]. 


X 10“° 



N 

Fig. 3: Matrix recovery from compressive measurements: normalized 
error R{M, versus the number of measurements m when p = 4 

and A = 0.002, for solutions obtained using CVX and PMLSVT, 
respectively. 


TABLE I: CPU mn time (in seconds) of solving SDP by using CVX 
and of the PMLSVT algorithms, with p — A and A = 0.002 and m 
measurements. 


m 

500 

750 

1000 

1250 

1500 

SDP 

725 

1146 

1510 

2059 

2769 

PMLSVT 

172 

232 

378 

490 

642 


the years 2011 and 2012 in Capital bike share system with the 
corresponding weather and seasonal information. We collect 
countings of 24 hours over 105 Saturdays into a 24-by-105 
matrix M {di = 24 and ^2 = 105). The resulted matrix is 
nearly low-rank. Assuming that only a fraction of the entries of 
this matrix are known (each entry is observed with probability 
0.5 and, hence, roughly half of the entries are observed), and 
that the counting numbers follow Poisson distributions with 
unknown intensities. We aim at recover the unknown intensities, 
i.e.. Ailing the missing data and performing denoising. We use 
PMLSVT with the following parameters: a = 1000, /3 = 1, 
t = 10~^, 1 ] = 1.1, K = 4000 and A = 100. In this case there 
is no “ground truth” for the intensities, and it is hard to measure 
the accuracy of recovered matrix. Instead, we are interested 
in identifying interesting patterns in the recovered results. As 
shown in Fig. 6(b), there are two clear increases in the counting 
numbers after the 17th and the 63th Saturday, which may not be 
easily identified from the original data in Fig. 6(a) with missing 
data and Poisson randomness. 


VI. Conclusions 

In this paper, we have studied matrix recovery and completion 
problem when the data are Poisson random counts. We con¬ 
sidered a maximum likelihood formulation with constrained 
nuclear norm of the matrix and entries of the matrix, and 
presented upper and lower bounds for the proposed estimators. 
We also developed a set of new algorithms, and in particular the 
efficient the Poisson noise Maximal Likelihood Singular Value 
Thresholding (PMLSV) algorithm. We have demonstrated its 
accuracy and efficiency compared with the semi-definite program 
(SDP) and tested on real data examples of solar flare images 
and bike sharing data. 
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Fig. 4: Matrix recovery from compressive measurements: normalized 
error R{M, M)/1’^ versus A when fixing m = 1000 and p = 4, using 
PMLSVT. 


(a) original data, p — 0.5. (b) A = 100, K = 4000. 

Fig. 6: Bike sharing count data: (a): observed matrix M with 50% 
missing entries; (b): recovered matrix with A = 100 and 4000 iterations, 
with an elapsed time of 3.147153 seconds. 






(a) p = 0.8. 



(e) p = 0.3. 



(b) A = 0.1, K = 2000. 



(d) A = 0.1, K = 2000. 



(f) A = 0.1, it: = 2000. 


Fig. 5: Matrix completion from partial observations: (a), (c), and (e): 
80%, 50% and 30% of entries observed (dark spots represent missing 
entries); (b), (d), and (f): images formed by complete matrix with 
A = 0.1 and no more than 2000 iterations, and the run times of the 
PMLSVT algorithm are 1.176595, 1.110226 and 1.097281 seconds, 
respectively. 
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Appendix A 

Proofs for matrix recovery 


Proof of Lemma 1: Given matrices Xi and X 2 , we can 
define Ui = vec{Xi), for i = 1, 2. Define the matrix 


A = 


vec(Ai)''' 

yec(A^y^ 


nmx (did2) 


then we have AXi = Aui for z = 1,2. Note that the matrix 
A can be written as A = where the entries of Z are 

drawn i.i.d. to take value (1 — p)/p with probability p or to 
take value y/pl{l — p) with probability I—p. Also, by noticing 
the correspondence between vector and matrix norms, we have 
IjXi —X2\\f = ||m1 —U2II2, \\AXi —AX2\\2 = \\Aui — ^^ 2112 - 
Moreover, since ||Xj||i4 = ||zzi||i and \\Xi\\F = ||ui||2, for 
z = 1,2, the fi-ball and the sphere defined 

for matrix of dimension (ii-by-(i2 can be translated to fi-ball 
and sphere for the corresponding vector space of dimension 
did2. Hence, all conditions of Theorem 1 [24] are satisfied, and 
we may apply it to a signal vector space of dimension did2 
and a matrix operator of dimension m-hy-{did2) to obtain the 
statement in Lemma 1. 


Proof of Lemma 2: Assume M and M are the tme matrix 
and its estimator, respectively. Let u = vec(M) and u = vec(M). 
Again, by making the links that AM = Au, AM = Au, 
\\AiM - M )\\2 = \\Aiu - u)\\ 2 , ||M||i,i = ||M||ia = / is 
eqnivalently ||t6||i = ||zz||i = I, as well as the RIP for the 
matrix operator Lemma 1, we may directly apply Theorem 2 
in [24] to the case of a vector of dimension did 2 and a matrix 
operator of dimension m-hy-did 2 to obtain the desired result. 

■ 

The proof of Theorem 1 requires the following lemma: 


Lemma 6 (Covering nnmber for low-rank matrices, Lemma 
4.3.1. in [29]). Let Sr = {AT € : rank{X) < r, ||X||f = 

1}, then there exists an e-net Sr C Sr, with respect to Frobenius 
norm, i.e., for any V € Sr, there exists Vq € Sr, such that 
||Vo — V\\f< £, and 


< 


/g\ {di+d2-\-l)r 

U/ 


that 


Ii0^"^l!2<||0|!2 = 


= .MxTx) 


( 42 ) 


\J j k j k 


Using the parameterization in (41), we can code X^^') using three 
steps by encoding the “magnitnde” ||0*'^^||2, the scaled rank-f 
matrices X^, and the valne £ of the rank itself. (1) Qnantize 
1111 into one of Vd bins that nniformly divide the interval 
[—/,/]. Let the result of the quantization to be Vg. Since there 
are s/d bins, encoding Vg reqnires f log2 d bits. (2) Qnantize 
Since ||xW||f = 1, using Lemma 6, we can form a e-net 
Sg such that for every f there is a corresponding xjf'^ € Sg 
with \\X^^'>-X‘i\\F < 9/v^ and \Sg\ = dlrfi+rfs+LUa. Hence, 
to encode the elements in Sg, we need f (di 4-^2 + l)f log2 d 
bits. (3) Finally, encode £, the rank of X^^f Since the rank of 
these matrices are at most, we need log2 d bits. Let 

X^g^^^VgXj^^/ (43) 


It can be verified that the above quantization scheme results in a 
set of approximations for X^^'> and a corresponding prefix code 
for X^^K From the three steps above, the average code length 
for X^^'> is upper bounded by | log2 d-l- ^(di -I- ^2 + l)^log2 d 
bits. Finally, to ensnre total intensity constraint and that each 
element of X is greater than c, we project Xg ' onto a set 

C = {X & : Xjk > c and ||A:||fi = /.} 

and use Vc {x'ip ) as a candidate estimator in F, where Vc is a 
projector operator onto the set C, i.e.. 


Ve{X) = argminljX' - X||f- 
X'ec 

Using the construction above for F, the complexity of X € F 
satisfies 

3 1 1 

pen(X) < -log 2 d+-{di+d 2 +l)£log 2 d < -(di-|-d2-l-4)f'log2 d. 


Now, given X = U*dmg{9*} V*'^, and let be its best 
rank-£ approximation, Xg^^ be the qnantized version of X^^\ 
for which we have 


Proof of Theorem 1: The proof of Theorem 1 involves 
constrncting a snitable set of estimators Fq and Q for Lemma 
2, estimate the sizes of the set, set the regnlarization fnnction 
pen(2f) such that it encourages low-rank M and satisfies Kraft 
ineqnality, and then invoking Lemma 2. Given G F, we 
introduce its scaled version 

= C/diag{6»(^Vll^''^^ll2}HT, 

so that = 1 and 

(41) 

Since all X G F satisfies ||Ar|ji_i = / and Xjk > 0, we have 


(44) 

= ||||d(^)||2X(^) (45) 

< 2|i||d(^)||2X(^)-r,X(^)|||. + 2||r,xW-r,xW||| (46) 

= 2(||d«||2-r,)2||xWf^ + 2r2||^(r) _^(r)||^ ^47^ 

r2 fii ifi/ir2 


where the last ineqnality follows from |||d*^^^||2 — 'kq\ < 
2II(2Ad), ||xW||f = 1, \rg\ < I, and ||XW - < 

9/s/d. Hence, using above, we can bound the distance between 
an arbitrary matrix X and its candidate estimator Vc^Xg^^) G F, 
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\\X-Vc{Xf))\\l 

= \\X - X^^^) + X^^) -Vc{x[^'>)\\l 

< 2||X - XW|||, + 2||XW - Vc{x[^^)\\l 

< 2\\X - X^^^ + X^^^ III+ 8/2 

< 4||/s:-X(^)||| + 4||x(^) -XW||| + 8/2 

< 4/2(coeV2(i/«-i/2) + 164// + 2), 

for some constant cq > 0 that depends only on q, where the 
second step is due to triangle inequality, 

||XW )fF < 2||xW||2, + 2||/^c(^i'))||?. 

< 2r2 + 2/2 < 4/2. 

Since Ve{X^^) e C, so ||'Pc(-^9^^)||i,i = I- Then dne to an 
argument similar to the last inequality in (42), 

\\Vdxf^)\\F<\\Ve{xf^)Ki = I. 

Above, we have nsed the ineqnality ||A + F| II < 2||A||| + 
2||F|||. The last ineqnality nses (48) and onr geometrically 
decay singular value assumption and (21). 

Given each 1 < / < /, let T^ be a set snch that the the 
elements in T^ are rank-/ and after projection onto set C they 
belong to T. Then from the estimate above loglT^I < ^{di + 
d 2 + 4)/log2 d. Therefore, Fe G G, whenever 

i(/i +/2 + 4)/log2/ < m/{c4^p), 


1 < / < 


Ci^ddi +/2 +4) log2/’ 


where the second inequality is because Cm,p is 0{m). ■ 

Appendix B 

Proofs for matrix completion 

To begin, we hrst recall some dehnitions from introduction and 
explain some additional notation that we will need for the proofs. 
For two probability distribntions V and Q on a conntable set 


A, D{V\\Q) will denote the Kullback-Leibler (KL) divergence 

4^(^iis) = Ewiog(m), 

where V{x) denotes the probability of the outcome x under the 
distribntion V. In the following, we will abnse this notation 
slightly, to mean the KL divergence between two Poisson 
distribntions with different parameters (the argnments in the 
notations denote parameters of the Poisson distribntions), in 
the following two ways. First, for scalar inpnts p,q G K+, 
we will set D{p\\q) = p\og{p/q) — {p — q), which gives the 
KL divergence between two Poisson probability distribntions. 
Second, we allow the KL divergence to act on matrices via the 
average KL divergence over their entries: for two matrices P, 
Q G we dehne 


D{P\\Q)^—Y,D{Pd\Qd- 

i,3 

For two probability distribntions V and Q on a conntable set 
A, d^fj{V, Q) will denote the Hellinger distance 


4(^,Q) = E(^ 


(x) - VQ(x) 


and we may invoke Lemma 2. The term 164// is independent 
of /, so we bring it out from the maximization with respect to £. 
Finally, under this condition, using the statement of Lemma 2, for 
a nearly low-rank matrix M, suppose its estimator constructed 
as above is given by we have 

lE[i?(M,M)] < C'™.j,^mm jl||M-/^c(MW)||| 

Apen(M^^^) 2c|^^log(c2^p/i/2/m) 

I ^ d 

< 0(m)(164//+ 2 + min 

A(/i + /2 + 4)/log2 /,. 
if 

/log(/i/ 2 /m)A 


Similarly, we abuse this notation slightly to denote the Hellinger 
distance between two Poisson distributions with different param¬ 
eters (the argnments in the notation denote parameters of the 
Poisson distributions). We use the Hellinger distance between two 
Poisson distributions, which, for two scalars p,q G K+, is given 
by, d'jjip, q) = 2 — 2exp | —4 (^/p — ■ For matrices P, 

Q G the average Hellinger distance is dehned by 

djiiP, Q) — -r~j~ E ^‘niPijTQij)- 


A. Proof of Theorem 2 

To prove Theorem 2, the key will be to establish the 
concentration ineqnality (Lemma 7) and the lower bonnd for 
the average Hellinger distance (Lemma 8). 

Lemma 7. Let Fq^y{X) be the likelihood function defined in 
(9) and S be the set defined in (10), then 


P<^ sup |Fo+(^)-E[Fa+W]| 
lxe5 

> C (as/r/fi) {aie^ - 2) + 31og(/i/2)) • (49) 

V"r(/i + /2) + /1/2 log(/i/2)) I < 

where C and C are absolute positive constants and the 
probability and the expectation are both over the choice of 
FI and the draw of Y. 

Lemma 8. For any two matrices P,Q G S, we have 

AolI d\d2 

where T = d{oi — /3)2. 
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We will prove Lemma 7 and Lemma 8 below, but first we 
use them in proving Theorem 2. 

Proof of Theorem 2: To begin, notice that for any choice 
of X &S, 


E [Fn^riX) — Th,y(M)] 


m 

did2 


E 


Mij log 



m 

did2 


E 


Mij log 


/ Mjj 


m 

did2 


id 


{Xij Mij ) 

— {Mij — Xij) 

mD{M\\X), 


(50) 


where the expectation is over both O and Y. 

On the other hand, note that by assumption the true matrix 
M G S. Then for any Z G S, consider the difference below 


Fa,Y{Z) — Fn.yiM) 

= -Fh,y(^) + (Z)] — E[Fo_y(Z)] 

+ E[Fo^y(M)] —E[Fay{M)] — Fay{M) 

= "^[FnyiZ)] — E[T'n^y(M)] + 

Fny{Z) — E[_Fh,y(-^)] + E[Fn,y(M)] — F^y{M) 

< E \Fa,Y{Z) — Fiiy{M)] + 

\Fo,y{Z) — E[Fo_y(Z)]| + \F^y{M) — E[Fo^y(M)]| 

< - mD{M\\Z) + 2 sup |Fn,y(X) - E[Fo,y(X)]|, (51) 

xgs 

where the second equality is to rearrange terms, the first 
inequality is due to triangle inequality, the last inequality is 
due to (50) and the fact that 


|i^n.y(^) - E[Fn.y(Z)]| < sup \FnyiX) - E[i^a,y(^)]| 


and 


\Fq,y{M) — E[Fo^y(M)]| < sup |Fn,y(2f) — E[Fo_y(X)]|. 

xes 

Moreover, from the definition of M, we also have that M G S 
and Fq y{M) > Fq y(M). Thus, by substituting M for Z in 
(51), we obtain 

0 < -mD{M\\M) + 2 sup \Fny{X) - E[Fa,y(W)]|. 
xes 


To bound the second term in the above expression, we apply 
Lemma 7, and obtain that with probability at least 1 — C/{did 2 ), 
we have 


0 < —mD{M\\M) + 2C' {a^/r/fd) (Q;(e^ ~ 2) + Slog{did 2 )) 
(^y/m{di + d2) + did2 log(difi2)) • 


After rearranging terms, and use the fact that \/did 2 < di + d 2 , 
we obtain 


D{M\\M) <2C' (av^//?) {a{e^ - 2) + Mog{did 2 )) ■ 

( I di + d2 L ^ {di + d2)\og{did2) \ (52) 

I V m V m j 


Note that the KL divergence can be bounded below by the 


Hellinger distance (Chapter 3 in [49]). Using our notation to 
denote the parameters of the Poisson distributions in the argument 
of the distance, we have 

d'\j{p,q) < D{p\\q), (53) 

for any two scalars p,q G M+ that denote the parameters of the 
Poisson distributions. Thus, (52) together with (53) lead to 

djf{M,M) < 2C' {a^lp) {a{e^ - 2) + 31og(did2)) • 

^di+d2 ^^ ^ (di+d2)log(c(id2) ^ 

(54) 

Finally, Theorem 2 follows immediately from Lemma 8. 


Next, we will establish a tail bound for Poisson distribution 
with the method of establishing Chemoff bounds. And this result 
will be used for proving Lemma 7. 


Lemma 9 (Tail bound for Poisson). For Y ~ Poisson{\) with 
\< a, P(y —A >t)< e~*, for all t>to where to = a(e^ —3). 


Proof of Lemma 9: The proof below is a specialized version 
of Chernoff bound for Poisson random variable [50] when A is 
upper bounded by a constant. For any 0 > 0, we have 

p(r-A>f) = p(y >f + A) 

= P {6Y > 6 {t + X)) =V (exp {9Y) > exp {9 {t + A))) 

< exp {—9 {t + A)) E (e^^) = exp(—0(A + t)) ■ exp (A(e^ — 1)) , 

where we have used Markov’s inequality and the moment 
generating function for Poisson random variable above, mow let 
0 = 2, we have 

exp(f) • P (U — A > f) < exp (—f + A(e^ — 3)) . 

Given to = a.{e^ — 3), then for all t > to, we have exp(f) • 

P (F — A > f) < 1. It follows that P (F — A > f) < when 
t>to> A(e2 - 3). ■ 


Proof of Lemma 7: 


We begin by noting that for any h, > 0, by using Markov’s 
inequality we have that 
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< E 


sup |Fa^y(X) — E[^h,y(-’^)]| 
xgs 

> C {ay/r/0) {a{e^ - 2) + 3log((ii(i 2 )) • 

+ d2) + did2 log(di(i2)) | 

sup |Fn,y(X) — E[Fn,y(X)] 
xes 

> {C {ay/r/j3) (a(e^ - 2) + 3log{did2)) ■ (55) 

(^\/m{di + d2) + did2 \og{did2) 

sup \Fn,YiX) - E[Fn,YiX)]\'^ 


and ||Z||* < a\/rdid 2 + \/did 2 by triangle inequality. Therefore, 
4>{Zij) is a contraction and it vanishes at 0. We obtain that 

I h~\ 


22ft-lg 


< 22'*"1E 


sup 

Xg5 


E 


sup 

x^s 


e n]}{Y,j{~\ogX,j)) 

max YJ] 

i,3 ^ 

^eyl{[(i,j) e n]}((-logXy)) 


X&s 


/ 


= 22'*-1E ImaxK'; 


[ hi 


{{o' {a^/r/jd) {a{e^ - 2) + 31og(did2)) • 
m{di +d2) + did2 \og{did2)^ ) }. 

The bound in (49) will follow by combining this with an upper 


E 


sup 

xes 


^eyl{[(i, j) e n]} ( 




bound on E 


h = \og{did2). 


snpxes\Fn,Y{X) - E[Fn,YiX)]\' 


and setting 


< 2 


2h-l 


/ 0 \ ^ 
(-) E 

max Y/^ 

V/3/ 

. . 11 


Let eijS be i.i.d. Rademacher random variables. In the 
following derivation, the first inequality is due the Radamacher 
symmetrization argument (Lemma 6.3 in [51]) and the second 
inequality is due to the power mean inequality: {a + b)^ < 
2^-1 _|_ jjh'^ if a, 6 > 0 and h > 1. Then we have 


E 




h 

sup 

^eyl{[(z,j) G ^]}Zij) 






= 2 


2h-l 


/ 0 \ ^ 

(-) E 

max 

E 

V/3/ 

. . 11 
1.3 \ 



sup |(Ao oE,Z)\ 
xes 


E 


sup \Eq^y{X) — ELn,y(-’^)| 
xgs 


(57) 


< 2'‘E 




h 

sup 



x^s 




< 2^E 

2h-i 

( 

sup 

'^eijl{[{i,j) G U]}(p(-logAij)) 

1 




i.3 

) 


+ 2^~^ I sup 
xes 


= 22'*"1E 


sup 

xes 


^eijl{[(z, j) G fI]}(yy(-logXij)) 


where E denotes the matrix with entries given by Cij, Aq 
denotes the indicator matrix for O and o denotes the Hadamard 
product. 

The dual norm of spectral norm is nuclear norm. Using 
the Holder’s inequality for Schatten norms in [52], which is, 

\{A,B)\ < Pillion*, we have 


22ft-iE 




h~ 

sup 

^eyl{[(z,j) G H]}(p(-logAy)) 


X^S 

ij' 



< 22'*-! ( ^ 1 E 

h 


max YJ] 

E 

. . 11 
1.3 \ 



sup llUoAorpli: 
xgs 






h~ 

sup 

^eyl{[(i,j) G U]}Ay 


Xg5 

i.3 



E[\\EoAnt], 


max Y^ 


(58) 


Similarly, the second term of (56) can be bounded as follows: 


(56) 


where the expectation are over both U and Y. 

To bound the first term of (56) with the assumption that 
IIAII* < ay^rdid 2 , we use a contraction principle (Theorem 
4.12 in [51]). We let (pit) = —/?log(f + 1). We know (/)(0) = 0 
and \(j) (f)| = \P/{t + 1)|, so {p (f)| < 1 if f > /3 — 1. Setting 
Z = A-ldixdsUhen wehave Zy > ;3-l,V(z,j) G |fii]x|fi2l 




h 

sup 

^eyl{[(i,j) G U]}Ay 


Xg5 

i.3 



22?i-1e 


<22'‘-1e sup poAof ||A||^ 
.xes 

_ h 

< 


( 59 ) 


22'^-! (\/^) E [||U o Aof; . 


16 



Plugging (58) and (59) into (56), we have 

E sup |Fa,y(X)-EFo,y(X)|' 
xgs 

< 22^-1 (av^+l)'*(y^)''E[||£;oAof] 

r ,1 

+ 1 


|l E 


max 


< Co 


(2(1+ V6))" U 


I m{di + d 2 ) + did 2 \og{did 2 ) 


di d- 


1«2 


E 


max K, 


~,2h-l / „h 


maxlFy - Mij\ 


< 22'*-! ( a’^ + (to)^ + [ P (max|r, 

V d(to)'* V 

< 22'*-! + {tof + P (^max W^j > dt 


y - Mij I >t]dt 


< 22'*-! a^ + {toT+E 


max Wjl 


variables, 


E 


max W., 


< 2/i! + log'*(did2)- 


(64) 


(60) 


Thus, we have 


E 


max Yj 






< 


^2h-l ^ ^ 2/i! + log^(did2)) . 


(65) 


To bound E [||ii^ o Ao||^], we use the very first inequality on 
Page 215 of [1]: 

E[||SoAnf] 

h 


Therefore, combining (65) and (60), we have 


E 


sup |Fn,y(2f) — E[_Fh,y(-^)]| 
xes 

h 


< 24^-1 (av^+l)'‘(y^) E[||T;oAof] 


( 66 ) 


for some constant Co- Therefore, the only term we need to 
bound is E [max^ j- YC\ . 

From Lemma 9, if f > to, then for any {i,j) G [di] X Ic^2l, 
the following inequality holds since to > a: 

P(lY,j-M,,l>t) 

= E (Yij > Mij + f) + P (Yij < Mij — t) (61) 
< exp(—f) + 0 = P(Wij > t), 

where WijS are independent standard exponential random 
variables. Because \Yij — Mij\s and Wij’s are all non-negative 
random variables and max(xi, a;2, ■ • ■, a:„) is an increasing 
function defined on M”, we have, for any d > 1, 

P ( maxl^ij — Mij\^ > f ) < P(maxFFi^ > t), (62) 

V / i,j 

for any t > (to)^- 

Below we use the fact that for any positive random variable 
q, we can write E[g] = P(g > t)dt, and then 


Then, 


E 


sup |Fa^y(A) — E[T'a^y(A)]| 
xes 


< 16(av^+l) (\/^)E[||T;oAof; 


- ] (a-\-to-\-2hlog{did2)) 


hi h 


< 16 (-j {axC + 1) E [P o Ant] " ' 

(Q;(e^ - 2) + 31og(did2)) 

< 128 (1 + V6 ) (a(e2 - 2) + 31og(did2)) 


. ° V /? 

(v"m(di + d2) + did2 log(did2)) . 


(67) 


where we use the fact that (a^+5^+c^+d^)^/^ < a+6+c+dif 
a,b,c,d > 0 in the first inequality and we take h = log(did2) > 
1 in the second and the third inequality. 

Plugging this into (55), we obtain that the probability in (55) 
is upper bounded by 


Cr 


^28(l + V6)V"''^'^^ ^0 


V 


C' 


< 


did2 


(63) 


provided that C > 128 (l + v^) e, which establishes this 
lemma. 


Proof of Lemma 8: Assuming x is any entry in P and y 
is any entry in Q, then P < x,y < a and 0 < \x — y\ < a — p. 
By the mean value theorem there exists an ^{x, y) G [P, a] such 
that 


Above, firstly we use triangle inequality and power mean 
inequality, then along with independence, we use (62) in the third 
inequality. By standard computations for exponential random 


1 


-{x-yf<T. 


The function f{z) = 1 — e~^ is concave in [0,+oo], so if 
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z € [0, T], we may bound it from below with a linear function 


1 - e-^ > (I 

Plugging 2; = ^{y/x- yTyf = - yf into (68), 

have 


2 - 2exp (^-^iVx - > 

1 ~ 1 , >2 

>- (x — yr. 

T 4a^ 


1 — e ^ 1 

T 45(x,2/) 


(a; - yf 


Note that (69) holds for any x and y. This concludes the proof. 


B. Proof of Theorem 3 

Before providing the proof, we first establish two useful 
lemmas. First, we consider the construction of the set x- 

Lemma 10 (Lemma A.3 in [1]). Let 

H^{X: ||X||, < a^rdid 2 , ||X||,, < a} 

and X < 1 be such that r is an integer. Suppose rjx'^< di, 
then we may construct a set x ^ H of size 

1^1 ^ ( 70 ) 

with the following properties: 

1) For all X G X’ each entry has \Xij\ = ay. 

2) For all X^^\X^^'> G x, i f j, 

\\X^^) - > a^X^did2/2. 

Second, we consider about the KL divergence. 

Lemma 11. For x,y > 0, D{x\\y) < {y-xf/y. 

Proof of Lemma 11: First assume x < y. Let z = y — x. 
Then z > 0 and D{x\\x + z) = x log + z. Taking the first 
derivative of this with respect to z, we have -^D{x\\x + z) = 
Thus, by Taylor’s theorem, there is some ^ G [0,z] so that 
D{x\\y) = D{x\\x) + z ■ Since the z^/{x + C) increases 

in we may replace ^ with z and obtain D{x\\y) < . 

For X > y, with the similar argument we may conclude that 
for z = y — X < 0 there is some ^ € [z, 0] so that D{x\\y) = 
D{x\\x) + z ■ Since z < 0 and ^/{x + increases in 
then z^/{x + ^) decreases in We may also replace ^ with z 
and this proves the lemma. ■ 

Next, we will show how Lemma 10 and Lemma 11 imply 
Theorem 3. We will prove the theorem by contradiction. 

Proof of Theorem 3: 

Without loss of generality, assume d 2 > di. We choose e > 0 
such that _ 

= min , (71) 

I 256 V m 1 

where C 2 is an absolute constant that will be be specified later. 
We will next use Lemma 10 to construct a set x, choosing 7 


such that r/ 7 ^ is an integer and 

4v^e 8e 

—^— < 7 < —■ 
a a 

We can make such a choice because 


a r r a r 
6^ ~ X^ ~ 3^ 


a r ui I a I , , 

-- = - > 4q; r > 1 

32e2 64e2 64e2 ^ ^ ^ 

We verify that such a choice for 7 satisfies the the requirements of 

Lemma 10. Indeed, since c ^ ^ and a > 1, we have 7 < | < 1. 

Further, we assume in the theorem that the right-hand side of 

(71) is larger than Cira^/di, which implies r/72 < di for an 

appropriate choice of Ci. 

Let x'a /2 7 whose existence is guaranteed in Lemma 

10, with this choice of 7 and with a/2 instead of a. Then we 
can construct x by defining 

X = ^X' + (1 “ Idixds : € Xa/2,7} 5 

where ldixd 2 denotes an (ii-by-ci2 matrix of all ones. Note that 
X has the same size as Xq /2 7’ satisfies (70). x also has 

the same bound on pairwise distances 

(72) 

for any two matrices X^'^fX^^^ G x- Define a' = (1 — 7)0, 
then every entry of X G x has Xij G {a, a'}. Since we assume 
r > 4 in theorem statement, for any X G x^ we have that for 
some X' G x' , 


|X|U = ||X' + a(l-|) UxdJ 


< -^s/rdpk + as/didf < as/rdfdf. 

Since the 7 we choose is less than 1/2, we have that cf is 
greater than a 12. Therefore, from the assumption that j3 < a/2, 
we conclude that x C 5. 

Now suppose for the sake of a contradiction that there exists 
an algorithm such that for any X G S, when given access to 
the measurements on Dq. returns X such that 

.^\\X-X\\l<e^ (73) 

U1U2 

with probability at least 1/4. We will imagine running this 
algorithm on a matrix X chosen uniformly at random from x- 
Let 

X* = argmin \\Z — X\\f. 
z^x 

By the same argument as that in [1], we can claim that X* = X 
as long as (73) holds. Indeed, for any X' G x with X' 7^ X, 
from (72) and (73), we have that 

||X' - > ||X' - X\\f - |!X - 

At the same time, since € x is a candidate for X*, we have 
that 

\\X*-X\\F<\\X-X\\F<s/^2e. 
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Thus, if (73) holds, then ||7f* — X\\f < \\X' — X\\f for any 
X' G X with X' 7^ X, and hence we must have X* = X. 


Appendix C 

Proofs of Lemma 3 and Lemma 4 


Using the assumption that (73) holds with probability at least 
1/4, we have that 

¥{X*^X)<^. (74) 

We will show that this probability must in fact be large, 
generating our contradiction. 


By a variant of Fano’s inequality in [53], we have 


p(a:* 7^ a:) > 1 - 


log Ixl 


(i,j)eOo 

and the maximum is taken over all pairs of different matrices 
and AT^^l in x- For any such pairs X^^\x‘'^'> G x. we 
know that D{x[^'^\\xf^) is either 0, D{a\\a'), or D{a'\\a) for 
(i, j) G |(ii] X |(i2l- Define an upper bound on the KL divergence 
quantities 

D= max D(A:l'=l||A:l'h. 

X('=)5<^X(0 

By the assumption that |f2o| = m, using Lemma 11 and the fact 
that a' < a, we have 

^ ^ ^ 64me^ 


Combining (74) and (75), we have that 

] <1-¥{X ^X*)< 

4 ^ ^ ^ log 1^1 


< 167^ 


< 1024e^ 


We now show that for appropriate values of Cq and C 2 , this 
leads to a contradiction. Suppose 64?Tte^ < a', then with (76), 
we have that 

4 - aVda ’ 

which together with (71) implies that a^rd 2 < 32. If we set 
Cq > 32, then this would lead to a contradiction. Next, suppose 
64me^ > a', then (76) simplifies to 

1 2. ( 128me^ \ 

4 ^ ® ^ ((l-7)a3rd2 j ' 

Since 1 — 7 > 1/2, we have 

2 I rd2 

1024 V m 

Setting C 2 < 1/1024 in (71) leads to a contradiction. Therefore, 
(73) must be incorrect with probability at least 3/4, which proves 
the theorem. 


Lemma 12. If f is a closed convex function satisfying Lipschitz 
condition (31), then for any X,Y G S, the following inequality 
holds: 

f{Y) < f{X) + {XfiX),Y- X) + ^\\Y- X\\l. 

Proof: Let Z = Y — X, then we have 
fiY) = f{X) + {Xf{X),Z) 

+ f {VfiX + tV)-Xf{X),Z) dt 
Jo 

< f{X) + {Xf{X),Z) 

+ [ \\fiX + tV)-Xf{X)\\F\\Z\\F dt 

< f{X) + {Vf{X),Z)+ [ Lt\\Z\\j,dt 

Jo 

= f{X) + {Vf{X),Y- X) + ^\\Y- X\\l, 

where we use Taylor expansion with integral remainder in the 
first line, the fact that dual norm of Frobenius norm is itself in 
the second line and Lipschitz condition in the third line. ■ 

In the following, proofs for Lemma 3 and Lemma 4 use 
results from [41]. 

Proof of Lemma 3: As is well known, proximal mapping 
of a y G S associated with a closed convex function h is given 
by 

P^oXtf^iY) = argnnn ■ h(X) + ^11^ - Y\\l^ , 

where f > 0 is a multiplier. In our case, h{P) = Is{P)- Define 
for each P G S that 

Gt(P)4i(P-prox,, (P-tXfiP))), 

then we can know by the characterization of subgradient that 

Gt{P)-Xf{P)Gdh{P), ill) 

where dh{P) is the subdifferential of h at P. Noticing that 
P — tGt{P) G S, then from Lemma 12 we have 

fiP - tGtiP)) < fiP) - {XfiP),tGt{P)) + |||Gt(P)|||, 

(78) 

for all 0 < f < 1/L. Define that g{P) = f{P) + h{P). 
Combining (77) and (78) and using the fact that / and h are 
convex functions, we have for any Z G S and 0 < t < 1/L, 

giP-tGtiP)) < g{Z) + {Gt{P),P-Z)-*-\\GtiP)\\l, (79) 

which is an analog to inequality (3.3) in [41]. Taking Z = M 
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Taking t = 1 jL, we have 


and P = Xk in (79), then we have for any fc > 0, 
g{Xk+i) - g(M) < {Gt{Xk),X, - M) - IwCt^Wl g(X,) - ^(M) < ■ 

2t ’ Since Xk € S for any k > 0 and M € 5, we have that 

h{Xk) = 0 for any fc > 0 and h{M) = 0, which completes the 
where we use the fact that (P, P) = ||P|||'. By taking Z = Xk proof. 

and P = Xk in (79) we know that g{Xk+i) < g{Xk) for any * 

k >0. Thus, taking t = 1/P, we have. 


. k—1 

g{Xk) - g{M) < V X! (^(^i+i) - ffW) 

z^O 

j k— 1 

^ ^ E (ii^* - ^Wf - ii^*+i - ^Wf) < 

z=0 


L\\Xo-M\\j, 

2k 

( 81 ) 


Since Xk € S for any k > 0 and M G 5, we have that 
h{Xk) = 0 for any fc > 0 and h{M) = 0, which completes the 
proof. 


Proof of Lemma 4: The definitions and notations in the 
proof of Lemma 3 are also valid in the proof of Lemma 4. 

Define that If = Xq and for any fc > 1, 

Ofe — , , 1 Vk — Xk-i H- {Xk — Xk-i) ■ 

k + I ttk 

For any 0 < t < 1/P, noticing that 

Xk = Zk-i — tGt{Zk-i), 
then we can rewrite 14 as 

14 = 14_i - —Gt{Zk-i). 
ak 

Taking Z = Xk-i and Z = M in (79) and making convex 
combination we have 

g{Xk) < (1 - ak)g{Xk-i) + akg{M) 

+ (kk{Gt{Zk-i), Vk-i — M) — -\\Gt{Zk-i)\\‘p 

= (1 - ak)giXk-i) + akg{M) 

+ f^(\\Vk-i-M\\l-\\Vk-M\\l)- 

(82) 

After rearranging terms, we have 

\{g{Xk)-g{M)) + h\Vk-WF< 

?-n ^ \ ^ 

-^{giXk-i) - g{M)) + -\\Vk-i - Mill. 

Q^k 

Notice that (1 — ak)l{a\) < l/{a\_fj for any fc > 1. Applying 
inequality (83) recursively, we have, 

^(g(X0-5(M)) + ^||Lfc-M|||< ^||Xo-M|||. (84) 
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